Add LTX-2.3 text-to-video generation support by prishajain1 · Pull Request #402 · AI-Hypercomputer/maxdiffusion

prishajain1 · 2026-05-10T09:28:16Z

This PR introduces end-to-end pipeline and model changes to support the LTX-2.3 multi-modal (audio-video) transformer model. It enables integrated text-to-audio-video generation using Gemma-based text conditioning, latent upsamplers, and vocoders.

Key architectural changes

Gated Cross-Modal Attention: Introduces a learnable gate (to_gate_logits) applied to all attention operations in the block (Self-Video, Self-Audio, Prompt-Cross, and Modal-Cross).
Prompt AdaLN (Noise-Aware Text Conditioning): It introduces Prompt AdaLN (self.prompt_adaln). For this specific cross-attention modulation, it derives scale and shift parameters directly from the continuous noise level (sigma)
Cross-Timestep Conditioning: When use_cross_timestep is enabled, it swaps the sigma (noise level) values used during the cross-modal attention steps (A2V and V2A).
Per-Modality Text Projections (Connectors): Introduces support for Per-Modality Projections (per_modality_projections=True). Instead of a shared feature extractor, it applies per-token RMS normalization to the raw hidden states and passes them through two separate linear projection layers (video_text_proj_in and audio_text_proj_in) before sending them to the respective video and audio connectors.
4-Way Batched Denoising Compile: In addition to CFG, LTX 2.3 introduces two massive new guidance concepts, requiring a 4-pass execution per step. These are spatiotemporal guidance and modality isolation guidance
Stabilizing Multi-Branch Guidance via x_0 Space
BWE Vocoder: Introduces the Bandwidth Extension (BWE) Vocoder (LTX2VocoderWithBWE).

Files added/modified

ltx2_3_video.yml file: New config file for LTX2.3
vocoder_ltx2.py: Added support for BWE vocoder
ltx2_pipeline.py: Enabled 4-way sliced batched inference (Uncond, Cond, Perturb, Isolated) and integrated velocity/x0 conversion delta equations with guidance rescaling.
transformer_ltx2.py: Propagated modality/perturbation masks to transformer blocks and integrated prompt adaptive layer norms.
generate_ltx2.py, pyconfig.py, common_types.py: Added support for LTX2.3
ltx2_utils.py: Added support to load new LTX2.3 specific weights
attention_ltx2.py: Added support for gated attention and perturbed attention
autoencoder_kl_ltx2.py: Added support for different upsample_type
embeddings_connector_ltx2.py: Added gated attention configurations (gated_attn) support to intermediate transformer block connectors.
feature_extractor_ltx2.py: support for per_modality_projections parameter added
text_encoders.py: Implemented dual-modality parallel text connectors routing, token-wise RMS scaling, and independent video-audio linear projections.

Sample outputs

Configuration	Generation Time	Video Link
(CFG + STG + MIG) Enabled	23.4s	Video
(CFG) Enabled	13.4s	Video
Upsampler Video (using LTX2.3)	26.76s	Video

Component wise breakdown

Pipeline Step	Duration
Denoising	17.2s
Text Encoding	4.1s
Vocoder	0.8s
Video Post	0.6s
Latent Processing	0.5s
Video VAE	0.4s
Connectors	0.1s
Audio VAE	0.1s
Preparation	0.0s

Tested:

scan_diffusion_loop = True and scan_diffusion_loop = False
scan_layers = True and scan_layers = False
No performance or quality regressions observed in the existing LTX2 pipeline

github-actions · 2026-05-10T13:30:02Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request successfully introduces support for LTX-2.3 text-to-video generation. It includes significant updates to the transformer architecture (gated attention, cross-modal modulation) and the denoising pipeline (4-way batched denoising for STG/CFG/MIG). The implementation is high-quality and integrates well with the existing LTX-2 infrastructure.

🔍 General Feedback

Redundant Patch File: The scratch_diff.patch file was likely added by mistake and should be removed before merging.
Robustness: A few areas in the pipeline (like audio_channels fallback and upsampler parameter inference) could be made more robust to handle different model versions and naming conventions.
Optimization: The use of nnx.jit for the vocoder and the optimized sequence length in smoke tests are excellent additions for performance and stability.

github-actions · 2026-05-11T05:20:46Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T05:20:50Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T05:26:41Z

🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-05-11T05:43:02Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T05:48:47Z

🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-05-11T10:15:07Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces comprehensive support for LTX-2.3 text-to-video generation, including the end-to-end pipeline, model updates, and a new vocoder with bandwidth extension (BWE). The implementation correctly handles complex features like Spatio-Temporal Guidance (STG) and Modality Isolation Guidance (MIG) using a 4-way batched denoising approach in JAX.

🔍 General Feedback

STG/MIG Logic: The implementation of the 4-way split denoising logic and the corresponding delta formulations for guidance is impressive and aligns well with the LTX-2.3 technical requirements.
Efficiency: Utilizing nnx.scan for the denoising loop ensures optimal performance on TPU/GPU hardware.
Redundancy: I identified some redundant initializations and assignments in the transformer and autoencoder models that should be cleaned up.
Parameter Initialization: Double-check the usage of nnx.Param with kernel_init, as nnx.Param typically only accepts the data tensor and might ignore additional keyword arguments.

Perseus14

Left a few comments. PTAL

Additional Comments

Could you test LTX2 in this branch and ensure that there is no regression?
Please test with scan_layers true/false as well
Please add e2e generation time as well as each component if possible

github-actions · 2026-05-12T06:57:48Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces comprehensive support for the LTX-2.3 multi-modal (audio-video) transformer model. It includes key architectural updates such as Gated Cross-Modal Attention, Prompt AdaLN, and a sophisticated Bandwidth Extension (BWE) Vocoder. The implementation is technically sound, highly optimized for JAX/TPU, and follows the project's established modular patterns.

🔍 General Feedback

4-Way Batched Denoising: The integration of Spatiotemporal Guidance (STG) and Modality Isolation Guidance (MIG) via 4-way batching is a major highlight, enabling advanced generation features.
Performance: Excellent use of JIT caching for the vocoder and conditional VAE replication to optimize inference latency.
Code Quality: The transition to more explicit logic for guidance (using x0 space) improves both readability and correctness compared to standard velocity-based CFG.

prishajain1 · 2026-05-12T12:27:32Z

Left a few comments. PTAL

Additional Comments

Could you test LTX2 in this branch and ensure that there is no regression?

Please test with scan_layers true/false as well

Please add e2e generation time as well as each component if possible

Added the above details for each component in the PR description along with details of what has been tested.

prishajain1 · 2026-05-12T16:16:22Z

unittests failure is unrelated to changes in this PR

prishajain1 requested a review from entrpn as a code owner May 10, 2026 09:28

prishajain1 marked this pull request as draft May 10, 2026 09:28

Perseus14 added the gemini-review label May 10, 2026

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py

Comment thread scratch_diff.patch Outdated

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated

Perseus14 assigned prishajain1 May 10, 2026

prishajain1 added gemini-review and removed gemini-review labels May 11, 2026

github-actions Bot reviewed May 11, 2026

View reviewed changes

Perseus14 self-requested a review May 11, 2026 15:17

prishajain1 force-pushed the ltx23 branch 2 times, most recently from 1890edb to 6e5961c Compare May 12, 2026 06:21

prishajain1 marked this pull request as ready for review May 12, 2026 06:22

Perseus14 reviewed May 12, 2026

View reviewed changes

Perseus14 added gemini-review and removed gemini-review labels May 12, 2026

github-actions Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/maxdiffusion/models/ltx2/transformer_ltx2.py

Comment thread src/maxdiffusion/models/ltx2/vocoder_ltx2.py

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated

prishajain1 force-pushed the ltx23 branch from f77b14a to 4e327ca Compare May 12, 2026 08:46

LTX2.3 model support

cb9ad39

prishajain1 force-pushed the ltx23 branch from f638133 to cb9ad39 Compare May 12, 2026 11:34

prishajain1 added the pull ready label May 12, 2026

prishajain1 requested a review from Perseus14 May 12, 2026 12:53

Perseus14 approved these changes May 12, 2026

View reviewed changes

copybara-service Bot merged commit 54898a3 into main May 12, 2026
16 of 18 checks passed

Conversation

prishajain1 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key architectural changes

Files added/modified

Sample outputs

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Perseus14 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prishajain1 commented May 12, 2026

Uh oh!

prishajain1 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prishajain1 commented May 10, 2026 •

edited

Loading